10-mlr

Author

Professor Shannon Ellis

Published

February 14, 2023

Multiple Linear Regression

Q&A

Q: I didn’t know we don’t have to complete all parts of the lab to get credit - how does it work if we’re graded on effort? Do we still get a full score?
A: We grade on concerted effort. Meaning, we look at everyone’s labs, and using a rubric relative to what we expect a student could complete in ~1h of work, we grade and give full credit if it looks like at least an hour of time was spent. Now, sometimes it’s hard to judge this as students work at different speeds. If you ever receive less than full credit (2) but feel you made a concerted effort that week in lab, reach out and we’ll chat about it!

Q: I am still confused about when are we supposed to use jitter on our scatterplot.
A: When points on a plot are on top of one another and you can’t tell how many observations are actually plotted, jittering is something to consider. You can alternatively use transparency (alpha) or scale the points relative to how many observations are present at each point. There’s no “you must jitter now” rule, but there are times to consider it.

Course Announcements

Due Dates:

  • Lab 05 due Friday (2/17; 11:59 PM)
  • mid-course survey (optional for EC) due Fri (2/17; 11:59 PM)
  • Lecture Participation survey “due” after class
  • HW03 will be available tomorrow (Wed); due (2/27) <- that’s a change

Agenda

  • Multiple Linear Regression
    • Multiple predictors
    • Main vs interaction effects
    • Model comparison
    • Backward selection

Packages & Data

library(tidyverse)
library(tidymodels)

Data: Paris Paintings

pp <- read_csv("https://raw.githubusercontent.com/COGS137/datasets/main/paris_paintings.csv", 
               na = c("n/a", "", "NA")) |> 
  mutate(log_price = log(price))
  • Number of observations: 3393
  • Number of variables: 62

Two numerical predictors

Multiple predictors

  • Response variable: log_price
  • Explanatory variables: Width and height
lin_mod <- linear_reg() |>
  set_engine("lm")

pp_fit <- lin_mod |>
  fit(log_price ~ Width_in + Height_in, data = pp)
tidy(pp_fit)
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   4.77     0.0579      82.4  0       
2 Width_in      0.0269   0.00373      7.22 6.58e-13
3 Height_in    -0.0133   0.00395     -3.36 7.93e- 4

Linear model with multiple predictors

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   4.77     0.0579      82.4  0       
2 Width_in      0.0269   0.00373      7.22 6.58e-13
3 Height_in    -0.0133   0.00395     -3.36 7.93e- 4


\[\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height\]

❓ How do we interpret this model?

Numerical and categorical predictors

Price, surface area, and living artist

  • Explore the relationship between price of paintings and surface area, conditioned on whether or not the artist is still living
  • First visualize and explore, then model
  • But first, prep the data:
pp <- pp |>
  mutate(artistliving = case_when(artistliving == 0 ~ "Deceased", 
                                  TRUE ~ "Living"))

pp |>
  count(artistliving)
# A tibble: 2 × 2
  artistliving     n
  <chr>        <int>
1 Deceased      2937
2 Living         456

Typical surface area

Typical surface area appears to be less than 1000 square inches (~ 80cm x 80cm). There are very few paintings that have surface area above 5000 square inches.

ggplot(data = pp, aes(x = Surface, fill = artistliving)) +
  geom_histogram(binwidth = 500) + 
  facet_grid(artistliving ~ .) +
  scale_fill_manual(values = c("#E48957", "#071381")) +
  guides(fill = "none") +
  labs(x = "Surface area", y = NULL) +
  geom_vline(xintercept = 1000) +
  geom_vline(xintercept = 5000, linetype = "dashed", color = "gray")

Narrowing the scope

For simplicity let’s focus on the paintings with Surface < 5000:

pp_Surf_lt_5000 <- pp |>
  filter(Surface < 5000)

ggplot(data = pp_Surf_lt_5000, 
       aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
  geom_point(alpha = 0.5) +
  labs(color = "Artist", shape = "Artist") +
  scale_color_manual(values = c("#E48957", "#071381"))

Facet to get a better look

ggplot(data = pp_Surf_lt_5000, 
       aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~artistliving) +
  scale_color_manual(values = c("#E48957", "#071381")) +
  labs(color = "Artist", shape = "Artist")

Two ways to model

  • Main effects: Assuming relationship between surface and logged price does not vary by whether or not the artist is living.
  • Interaction effects: Assuming relationship between surface and logged price varies by whether or not the artist is living.

Interacting explanatory variables

  • Including an interaction effect in the model allows for different slopes, i.e.  nonparallel lines.
  • This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes.
  • This can be accomplished by adding an interaction variable: the product of two explanatory variables.

Two ways to model

  • Main effects: Assuming relationship between surface and logged price does not vary by whether or not the artist is living
  • Interaction effects: Assuming relationship between surface and logged price varies by whether or not the artist is living

❓ Which does your intuition/knowledge of the data suggest is more appropriate?

Put a green sticky if you think main; pink if you think interaction.

Fit model with main effects

  • Response variable: log_price
  • Explanatory variables: Surface area and artistliving
pp_main_fit <- lin_mod |>
  fit(log_price ~ Surface + artistliving, data = pp_Surf_lt_5000)
tidy(pp_main_fit)
# A tibble: 3 × 5
  term               estimate std.error statistic  p.value
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        4.88     0.0424       115.   0       
2 Surface            0.000265 0.0000415      6.39 1.85e-10
3 artistlivingLiving 0.137    0.0970         1.41 1.57e- 1

\[\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times artistliving\]

Solving the model

  • Non-living artist: Plug in 0 for artistliving

\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 0\)
\(= 4.88 + 0.000265 \times surface\)

  • Living artist: Plug in 1 for artistliving

\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 1\)
\(= 5.017 + 0.000265 \times surface\)

Visualizing main effects

  • Same slope: Rate of change in price as the surface area increases does not vary between paintings by living and non-living artists.
  • Different intercept: Paintings by living artists are consistently more expensive than paintings by non-living artists.

Interpreting main effects

tidy(pp_main_fit) |> 
  mutate(exp_estimate = exp(estimate)) |>
  select(term, estimate, exp_estimate)
# A tibble: 3 × 3
  term               estimate exp_estimate
  <chr>                 <dbl>        <dbl>
1 (Intercept)        4.88           132.  
2 Surface            0.000265         1.00
3 artistlivingLiving 0.137            1.15
  • All else held constant, for each additional square inch in painting’s surface area, the price of the painting is predicted, on average, to be higher by a factor of 1.
  • All else held constant, paintings by a living artist are predicted, on average, to be higher by a factor of 1.15 compared to paintings by an artist who is no longer alive.
  • Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 132 livres.

Main vs. interaction effects

  • The way we specified our main effects model only lets artistliving affect the intercept.
  • Model implicitly assumes that paintings with living and deceased artists have the same slope and only allows for different intercepts.

❓ What seems more appropriate in this case?

  • Same slope and same intercept for both colors
  • Same slope and different intercept for both colors
  • Different slope and different intercept for both colors

Interaction: Surface * artistliving

Fit model with interaction effects

  • Response variable: log_price
  • Explanatory variables: Surface area, artistliving, and their interaction
pp_int_fit <- lin_mod |>
  fit(log_price ~ Surface * artistliving, data = pp_Surf_lt_5000)
tidy(pp_int_fit)
# A tibble: 4 × 5
  term                        estimate std.error statistic    p.value
  <chr>                          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)                 4.91     0.0432       114.   0         
2 Surface                     0.000206 0.0000442      4.65 0.00000337
3 artistlivingLiving         -0.126    0.119         -1.06 0.289     
4 Surface:artistlivingLiving  0.000479 0.000126       3.81 0.000139  

Linear model with interaction effects

# A tibble: 4 × 5
  term                        estimate std.error statistic    p.value
  <chr>                          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)                 4.91     0.0432       114.   0         
2 Surface                     0.000206 0.0000442      4.65 0.00000337
3 artistlivingLiving         -0.126    0.119         -1.06 0.289     
4 Surface:artistlivingLiving  0.000479 0.000126       3.81 0.000139  

\[\widehat{log\_price} = 4.91 + 0.00021 \times surface - 0.126 \times artistliving\] \[+ ~ 0.00048 \times surface * artistliving\]

Interpretation of interaction effects

  • Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes)
  • Some paintings by living artists are more expensive than paintings by non-living artists, and some are not (different intercept).
  • Non-living artist: \(\widehat{log\_price} = 4.91 + 0.00021 \times surface\) \(- 0.126 \times 0 + 0.00048 \times surface \times 0\) \(= 4.91 + 0.00021 \times surface\)
  • Living artist: \(\widehat{log\_price} = 4.91 + 0.00021 \times surface\) \(- 0.126 \times 1 + 0.00048 \times surface \times 1\) \(= 4.91 + 0.00021 \times surface\) \(- 0.126 + 0.00048 \times surface\) \(= 4.784 + 0.00069 \times surface\)

Comparing models

R-squared

  • \(R^2\) is the percentage of variability in the response variable explained by the regression model.
glance(pp_main_fit)$r.squared
[1] 0.01320884
glance(pp_int_fit)$r.squared
[1] 0.0176922
  • Clearly the model with interactions has a higher \(R^2\).
  • However using \(R^2\) for model selection in models with multiple explanatory variables is not a good idea as \(R^2\) increases when any variable is added to the model.

Adjusted R-squared

It appears that adding the interaction actually increased adjusted \(R^2\), so we should indeed use the model with the interactions.

glance(pp_main_fit)$adj.r.squared
[1] 0.01258977
glance(pp_int_fit)$adj.r.squared
[1] 0.01676753

Third order interactions

  • Can you? Yes
  • Should you? Probably not if you want to interpret these interactions in context of the data.

In pursuit of Occam’s razor

  • Occam’s Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.

  • Model selection follows this principle.

  • We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.

  • In other words, we prefer the simplest best model, i.e. parsimonious model.

Backward selection

For this demo, we’ll ignore interaction effects…and just model main effects to start:

pp_full <-  lin_mod |>
  fit(log_price ~ Width_in + Height_in + Surface + artistliving, data=pp) 

glance(pp_full)$adj.r.squared
[1] 0.02570141
  • \(R^2\) (full): 0.0257014`

Remove artistliving

pp_noartist <- lin_mod |>
  fit(log_price ~ Width_in + Height_in + Surface, data=pp) 

glance(pp_noartist)$adj.r.squared
[1] 0.02579859
  • \(R^2\) (full): 0.0257
  • \(R^2\) (no artistliving): 0.0258

…Improved variance explained

Remove Surface

pp_noartist_nosurface <- lin_mod |>
  fit(log_price ~ Width_in + Height_in, data=pp) 

glance(pp_noartist_nosurface)$adj.r.squared
[1] 0.02231559
  • \(R^2\) (full): 0.0257
  • \(R^2\) (no artistliving): 0.0258
  • \(R^2\) (no artistliving or Surface): 0.0223

…no longer gaining improvement, so we stick with: log_price ~ Width_in + Height_in + Surface

Other approach:

# requires package installation: 
# install.packages("olsrr")
library(olsrr)

Step 1: Fit model (w/o tidymodels)

# fit the model (not using tidymodels)
mod <- lm(log_price ~ Width_in + Height_in + Surface + artistliving, data=pp_Surf_lt_5000)

Step 2: Determine which variables to remove

ols_step_backward_p(mod)

                             Elimination Summary                               
------------------------------------------------------------------------------
        Variable                      Adj.                                        
Step      Removed       R-Square    R-Square     C(p)        AIC         RMSE     
------------------------------------------------------------------------------
   1    artistliving      0.0261      0.0251    3.8495    12603.7727    1.8315    
------------------------------------------------------------------------------

…specifies that artistliving should be removed

Step 2 (alternate): Compare all possible models…

ols_step_all_possible(mod) |>
  arrange(desc(adjr))
   Index N                              Predictors     R-Square Adj. R-Square
1     11 3              Width_in Height_in Surface 0.0260749939  0.0251349118
2     15 4 Width_in Height_in Surface artistliving 0.0263412027  0.0250876993
3      5 2                      Width_in Height_in 0.0256902566  0.0250634893
4     12 3         Width_in Height_in artistliving 0.0259732581  0.0250330779
5      6 2                        Width_in Surface 0.0249136264  0.0242863596
6     13 3           Width_in Surface artistliving 0.0251787948  0.0242378477
7      7 2                   Width_in artistliving 0.0212864021  0.0206568018
8      1 1                                Width_in 0.0209415833  0.0206267736
9      8 2                    Surface artistliving 0.0132088377  0.0125897717
10     2 1                                 Surface 0.0125899681  0.0122803381
11    14 3          Height_in Surface artistliving 0.0130836930  0.0121310711
12     9 2                       Height_in Surface 0.0126782901  0.0120431523
13    10 2                  Height_in artistliving 0.0062698155  0.0056305552
14     3 1                               Height_in 0.0058797727  0.0055601199
15     4 1                            artistliving 0.0005531617  0.0002397573
   Mallow's Cp
1     3.849487
2     5.000000
3     3.077206
4     4.174132
5     5.555476
6     6.709309
7    17.130153
8    16.230489
9    51.971190
10   52.001268
11   45.305459
12   44.599123
13   65.048926
14   64.293574
15   91.485605

Recap

  • Can you model and interpret linear models with multiple predictors?
  • Can you explain the difference in a model with main effects vs. interaction effects?
  • Can you compare different models and determine how to proceed?
  • Can you carry out and explain backward selection?

Suggested Reading

Introduction to Modern Statistics Chapter 8: Linear Regression with Multiple Predictors